{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "!pip install autokeras\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## A Simple Example\n",
    "The first step is to prepare your data. Here we use the [IMDB\n",
    "dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) as\n",
    "an example.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from tensorflow.keras.datasets import imdb\n",
    "from sklearn.datasets import load_files\n",
    "\n",
    "dataset = tf.keras.utils.get_file(\n",
    "    fname=\"aclImdb.tar.gz\",\n",
    "    origin=\"http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\",\n",
    "    extract=True,\n",
    ")\n",
    "\n",
    "# set path to dataset\n",
    "IMDB_DATADIR = os.path.join(os.path.dirname(dataset), 'aclImdb')\n",
    "\n",
    "classes = ['pos', 'neg']\n",
    "train_data = load_files(os.path.join(IMDB_DATADIR, 'train'), shuffle=True, categories=classes)\n",
    "test_data = load_files(os.path.join(IMDB_DATADIR,  'test'), shuffle=False, categories=classes)\n",
    "\n",
    "x_train = np.array(train_data.data)\n",
    "y_train = np.array(train_data.target)\n",
    "x_test = np.array(test_data.data)\n",
    "y_test = np.array(test_data.target)\n",
    "\n",
    "print(x_train.shape)  # (25000,)\n",
    "print(y_train.shape)  # (25000, 1)\n",
    "print(x_train[0][:50])  # this film was just brilliant casting\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "The second step is to run the [TextClassifier](/text_classifier).\n",
    "As a quick demo, we set epochs to 2.\n",
    "You can also leave the epochs unspecified for an adaptive number of epochs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "import autokeras as ak\n",
    "\n",
    "# Initialize the text classifier.\n",
    "clf = ak.TextClassifier(\n",
    "    overwrite=True,\n",
    "    max_trials=1)  # It only tries 1 model as a quick demo.\n",
    "# Feed the text classifier with training data.\n",
    "clf.fit(x_train, y_train, epochs=2)\n",
    "# Predict with the best model.\n",
    "predicted_y = clf.predict(x_test)\n",
    "# Evaluate the best model with testing data.\n",
    "print(clf.evaluate(x_test, y_test))\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Validation Data\n",
    "By default, AutoKeras use the last 20% of training data as validation data.\n",
    "As shown in the example below, you can use `validation_split` to specify the percentage.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "clf.fit(x_train,\n",
    "        y_train,\n",
    "        # Split the training data and use the last 15% as validation data.\n",
    "        validation_split=0.15)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "You can also use your own validation set\n",
    "instead of splitting it from the training data with `validation_data`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "split = 5000\n",
    "x_val = x_train[split:]\n",
    "y_val = y_train[split:]\n",
    "x_train = x_train[:split]\n",
    "y_train = y_train[:split]\n",
    "clf.fit(x_train,\n",
    "        y_train,\n",
    "        epochs=2,\n",
    "        # Use your own validation set.\n",
    "        validation_data=(x_val, y_val))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Customized Search Space\n",
    "For advanced users, you may customize your search space by using\n",
    "[AutoModel](/auto_model/#automodel-class) instead of\n",
    "[TextClassifier](/text_classifier). You can configure the\n",
    "[TextBlock](/block/#textblock-class) for some high-level configurations, e.g., `vectorizer`\n",
    "for the type of text vectorization method to use.  You can use 'sequence', which uses\n",
    "[TextToInteSequence](/block/#texttointsequence-class) to convert the words to\n",
    "integers and use [Embedding](/block/#embedding-class) for embedding the\n",
    "integer sequences, or you can use 'ngram', which uses\n",
    "[TextToNgramVector](/block/#texttongramvector-class) to vectorize the\n",
    "sentences.  You can also do not specify these arguments, which would leave the\n",
    "different choices to be tuned automatically.  See the following example for detail.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "import autokeras as ak\n",
    "\n",
    "input_node = ak.TextInput()\n",
    "output_node = ak.TextBlock(block_type='ngram')(input_node)\n",
    "output_node = ak.ClassificationHead()(output_node)\n",
    "clf = ak.AutoModel(\n",
    "    inputs=input_node,\n",
    "    outputs=output_node,\n",
    "    overwrite=True,\n",
    "    max_trials=1)\n",
    "clf.fit(x_train, y_train, epochs=2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "The usage of [AutoModel](/auto_model/#automodel-class) is similar to the\n",
    "[functional API](https://www.tensorflow.org/guide/keras/functional) of Keras.\n",
    "Basically, you are building a graph, whose edges are blocks and the nodes are intermediate outputs of blocks.\n",
    "To add an edge from `input_node` to `output_node` with\n",
    "`output_node = ak.[some_block]([block_args])(input_node)`.\n",
    "\n",
    "You can even also use more fine grained blocks to customize the search space even\n",
    "further. See the following example.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "import autokeras as ak\n",
    "\n",
    "input_node = ak.TextInput()\n",
    "output_node = ak.TextToIntSequence()(input_node)\n",
    "output_node = ak.Embedding()(output_node)\n",
    "# Use separable Conv layers in Keras.\n",
    "output_node = ak.ConvBlock(separable=True)(output_node)\n",
    "output_node = ak.ClassificationHead()(output_node)\n",
    "clf = ak.AutoModel(\n",
    "    inputs=input_node,\n",
    "    outputs=output_node,\n",
    "    overwrite=True,\n",
    "    max_trials=1)\n",
    "clf.fit(x_train, y_train, epochs=2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Data Format\n",
    "The AutoKeras TextClassifier is quite flexible for the data format.\n",
    "\n",
    "For the text, the input data should be one-dimensional \n",
    "For the classification labels, AutoKeras accepts both plain labels, i.e. strings or\n",
    "integers, and one-hot encoded encoded labels, i.e. vectors of 0s and 1s.\n",
    "\n",
    "We also support using [tf.data.Dataset](\n",
    "https://www.tensorflow.org/api_docs/python/tf/data/Dataset?version=stable) format for\n",
    "the training data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "train_set = tf.data.Dataset.from_tensor_slices(((x_train, ), (y_train, ))).batch(32)\n",
    "test_set = tf.data.Dataset.from_tensor_slices(((x_test, ), (y_test, ))).batch(32)\n",
    "\n",
    "clf = ak.TextClassifier(\n",
    "    overwrite=True,\n",
    "    max_trials=2)\n",
    "# Feed the tensorflow Dataset to the classifier.\n",
    "clf.fit(train_set, epochs=2)\n",
    "# Predict with the best model.\n",
    "predicted_y = clf.predict(test_set)\n",
    "# Evaluate the best model with testing data.\n",
    "print(clf.evaluate(test_set))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Reference\n",
    "[TextClassifier](/text_classifier),\n",
    "[AutoModel](/auto_model/#automodel-class),\n",
    "[TextBlock](/block/#textblock-class),\n",
    "[TextToInteSequence](/block/#texttointsequence-class),\n",
    "[Embedding](/block/#embedding-class),\n",
    "[TextToNgramVector](/block/#texttongramvector-class),\n",
    "[ConvBlock](/block/#convblock-class),\n",
    "[TextInput](/node/#textinput-class),\n",
    "[ClassificationHead](/block/#classificationhead-class).\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "text_classification",
   "private_outputs": false,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}